Hw14 #5

sme229 · 2024-02-24T22:10:46Z

FASTQ filter function using biopython and task 5

nvaulin · 2024-03-02T11:50:58Z

misc_module

Кажется у тебя есть гит-репозиторий внутри гит-репозитория

nvaulin · 2024-03-02T11:51:18Z

bio_files_processor.py

Раз файл пустой можно удалить:)

nvaulin · 2024-03-02T12:00:11Z

biopython_fastq_filter.py

+from Bio import SeqIO
+from Bio.SeqUtils import GC
+
+def filter_fastq(input_path: str, quality_threshold: int, output_filename="final_filtered.fastq",  gc_bounds=(40, 60), length_bounds=(50, 350)):


Тут не хватает докстринги (тем более же что по идее увас они были в прошлом семестре

nvaulin · 2024-03-02T12:00:34Z

biopython_fastq_filter.py

+    ###GC content filter
+    min_gc_content = gc_bounds[0]
+    max_gc_content = gc_bounds[1]
+    GC_quality_filt = []


Переменные только с маленькими буквами!

nvaulin · 2024-03-02T12:11:10Z

biopython_fastq_filter.py

+    result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq")
+    result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq")


С точки зрения использования функционала биопитона - все супер. Но как код организован - не очень хорошо к сожалению. У тебя получается как-то разбросаны проверки, при чем после каждой проверке ты создаешь новый файл на компьютере куда записываешь сиквенсы. Ну и между этими промежутками чтения и записи все файлы ты держишь в памяти. В этом плане все можно было бы организовать гораздо проще и еще и лучше.

Сперва мы открываем файл на запись. После этого мы в цикле for читаем записи из входного файла. На каждой итерации цикла мы держим в памяти одну запись, мы проверяем ее всеми нужными проверками и если все ок - то записыаем в файл. При этом никаких промежуточных файлов в большом количестве не создается.

Псведокод:

with open(output) as outf: for seq in SeqIO.parse(input): if ok and ok and ok: SeqIO.write(seq, outf)

nvaulin · 2024-03-02T12:15:05Z

biopython_fastq_filter.py

+        list_input = list(self.seq)
+        for i in range(len(self.seq)):


не надо итерироваться по индексам и потом брать [i] если можно сразу делать цикл по буквам

nvaulin · 2024-03-02T12:15:53Z

biopython_fastq_filter.py

+        for i in range(len(self.seq)):
+            if list_input[i] in self.complement_dict:
+                list_input[i] = self.complement_dict[list_input[i]]
+        return "".join(list_input)


У тебя получается была ДНК, а после complement стала просто строка. Чтобы тип данных оставался как есть можно сделать:

Suggested change

return "".join(list_input)

return type(self)("".join(list_input))

Если не понимаешь что тут - можем на консультации еще обсудить

nvaulin · 2024-03-02T12:16:04Z

biopython_fastq_filter.py

+    complement_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G', 'a': 't', 't': 'a', 'g': 'c', 'c': 'g'}
+    def __init__(self, seq):
+        super().__init__(seq)
+        #self.complement()


Suggested change

#self.complement()

nvaulin · 2024-03-02T12:16:28Z

biopython_fastq_filter.py

+    def __init__(self, seq):
+        super().__init__(seq)


в целом это можно не писать, так как по дефолту инит родителя и будет работать

nvaulin · 2024-03-02T12:17:01Z

biopython_fastq_filter.py

+            if list_input[i] in self.complement_dict:
+                list_input[i] = self.complement_dict[list_input[i]]


А что если else? Кажется такого быть не может, если при инициализации происходит поверка что все буквы валидны

lsmertina and others added 23 commits October 7, 2023 10:41

Add fastq filter module

0496a2a

Add functions for protein sequences

5f3ef9f

Add nucleic acid functions from hw3

e022375

Add the main script with 3 functions

fc02f82

Create a script

e44c87b

Create bio_files_processor.py file for HW6

92e6910

add script and sample data

b8cd252

Delete modules directory

abc3bd6

Delete miscellaneous.py

3b1af5c

Delete updated_HW5.py

92774b2

add requirements txt

ca3388d

Create requirements.txt

0fca5dc

delete files

8a853c6

Merge branch 'HW14' of github.com:sme229/misc_module into HW14

3eb47f6

add biopython fastq filter script

8886a58

add an example fastq file

c08ab84

add script for task 5 hw14

d962c09

Update biopython_fastq_filter.py

597f3c0

Update biopython_fastq_filter.py

3b0861e

Update biopython_fastq_filter.py

ca88ffc

Update biopython_fastq_filter.py

840cbb7

Update biopython_fastq_filter.py

75d7629

Update biopython_fastq_filter.py

29ce70b

nvaulin reviewed Mar 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hw14 #5

Hw14 #5

Uh oh!

sme229 commented Feb 24, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

nvaulin Mar 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq")
		result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq")

	return "".join(list_input)
	return type(self)("".join(list_input))

		if list_input[i] in self.complement_dict:
		list_input[i] = self.complement_dict[list_input[i]]

Hw14 #5

Are you sure you want to change the base?

Hw14 #5

Uh oh!

Conversation

sme229 commented Feb 24, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants